Hash sort: A linear time complexity multiple-dimensional sort algorithm
نویسنده
چکیده
Sorting and hashing are two completely different concepts in computer science, and appear mutually exclusive to one another. Hashing is a search method using the data as a key to map to the location within memory, and is used for rapid storage and retrieval. Sorting is a process of organizing data from a random permutation into an ordered arrangement, and is a common activity performed frequently in a variety of applications. Almost all conventional sorting algorithms work by comparison, and in doing so have a linearithmic greatest lower bound on the algorithmic time complexity. Any improvement in the theoretical time complexity of a sorting algorithm can result in overall larger gains in implementation performance. A gain in algorithmic performance leads to much larger gains in speed for the application that uses the sort algorithm. Such a sort algorithm needs to use an alternative method for ordering the data than comparison, to exceed the linearithmic time complexity boundary on algorithmic performance. The hash sort is a general purpose non-comparison based sorting algorithm by hashing, which has some interesting features not found in conventional sorting algorithms. The hash sort asymptotically outperforms the fastest traditional sorting algorithm, the quick sort. The hash sort algorithm has a linear time complexity factor – even in the worst case. The hash sort opens an area for further work and investigation into alternative means of sorting. 1. Theory. 1.1. Sorting. Sorting is a common processing activity for computers to perform. Data that is sorted is of the form where data items have an increasing value – ascending, or alternatively decreasing value – descending. Regardless of the arrangement form, sorting establishes an ordering of the data. The arrangement of data from some random configuration into an ordered one is necessary often in many algorithms, applications, and programs. The need for sorting to arrange and order data makes the space and temporal complexity of the algorithm used very paramount. A bad choice of algorithm for sorting data by a designer or programmer can result in mediocre performance in the end. With the large need for sorting, and the important concern of performance, many different types and kinds of algorithms for sorting have been devised. Some algorithms used such as quick sort and bubble sort are very widespread and often used. Other algorithms such as bin sort and pigeonhole sort are not as widely known. The plethora of sorting algorithms available does not change the tantamount question of how fast it is possible to sort. This is a question of temporal efficiency, and is the most significant criterion for a sort algorithm. Along with it, what will affect the temporal efficiency and why it does is just as important a concern. Still another important concern, though more subtle, is the space requirements for using Special thanks and appreciation to Dr. Michael Mascagni for the opportunity to present and publish this paper. 1 the algorithm – space efficiency. This is a matter of algorithm overhead and resource requirements involved in the sorting process. All sort algorithms for the most part have the temporal efficiency and causes for changes in it are well understood. This limit or barrier to sorting speed is a greatest lower bound which is O(N log N). In essence, the reasoning behind this is that sorting is a comparative or decision making process of N items, which forms a binary tree of depth log N . For each data item, a decision must be made where to move it to maintain the ordering property desired. With N data items, and log N decision time, the minimum speed possible is the product, which is N log N . This lower bound is often never reached in practice, it is often a multiplied or added constant away from the theoretical maximum. There is no theoretic greatest lower bound for space efficiency, and often this complexity measure characterizes a sort algorithm. The space requirements are highly dependent on the underlying method used in the sort algorithm, so space efficiency directly reflects this. While no theoretical limit is available, an optimum sort algorithm will require N +1 storage. This is the most optimal because N data items, and one additional data item used by the sort algorithm. A bubble sort algorithm has this optimal storage efficiency. However, the optimal space efficiency is subordinate to the temporal efficiency in a sorting algorithm. The bubble sort, while having an optimal space efficiency, is well reputed to be very poor in performance, far above the theoretical lower bound. 1.2. Hashing. Hashing is a process of searching through data, using each data item itself as a key in the search. Hashing does not order the data items with respect to each other. Hashing organizes the data according to a mapping function which is used to hash each data item. Hashing is an efficient method to organize and search data quickly. The temporal efficiency or speed of the hashing process is determined by the hash function and its complexity. Hash functions are mathematically based, and a common hash function uses the remainder mod operation to map data items. Other hash functions are based on mathematical formulas and equations. The construction of a hash function is usually from multiplicative, divisional, additive operations, or some mix of them. The choice of hash function follows from the data items and involves temporal and space organization compromises. Hashing is not a direct mathematical mapping of the data items into a space organization. Hashing functions usually have a phenomena of a hash collision or clash, where two data items map to the same spatial location. This is detrimental to the hash function, and many techniques for handling hash collisions exist. Hash collisions introduce temporal inefficiency into a hash algorithm, as the handling of a collision represents additional time to re-map the data item. A special type of hash function, a perfect hash function, exists and is perfect in the sense it will not have collisions. These types of perfect hash functions are often available only for narrow applications, reserved words in a compiler symbol table for example. Perfect hash functions usually involve restricted data items and do not fit in to a general purpose hash method While perfect hashing is possible, often a regular 2 hash function is used with data in which the number of collisions is few or the collision resolution method is simple. 2. Algorithm. 2.1. Super-Hash Function. The heart of the hash sort algorithm is the concept of combining hashing along with sorting. This key concept is embodied in the notion of a super-hash function. A super-hash function is ”super” in that it is a hash function that has a consistent ordering, and is a perfect hash function. A hash function of this type will order the data by hashing, but maintain a consistent ordering and not scatter the data. This hash function is perfect so that as the data is ordered it is uniquely placed when hashed, so that no data items ambiguously share a location in the ordered data. Along with that, the necessity for hash collision resolution can be avoided altogether. The super-hash function is a generalized function, in that it is extendible mathematically, and is not a specialized type of hash function. The super-hash function operates on a data set which is of positive integer values. With a super-hash function, the restrictions on the data set as the domain is that it is within a bounded range, between a minimum and maximum value. The only other restriction is that each integer value be unique – no duplicate data values. This highly closed set of data values as positive integers is necessary to build a preliminary super-hash function. Once a mathematically proven super-hash function is formulated, then other less restricted data sets can be explored. A super-hash function as described and within the parameters is not a complex or rigorous function to define. A super-hash function uses the standard hash function using the modulus or residue operator. This operator is integer remainder, called mod, is a common hash function in the form (x mod n). The other part of the super-hash function is another hash function called a mash function, for modified hash or magnitude hash. The mash function uses the integer division operator. This operator, called div, is used in the form similar to the hash of (x div n). Both of these functions, the mash function and the hash function, together form the superhash function. This super-hash function is mathematically based and extensible, and is also a perfect hash function. For the super-hash function to be perfect, it must be an injective mapping. The super-hash function works using a combination of a regular hash function and the mash function. Together both of these functions make a super-hash function, but not as the composition of the two. Both the hash function, and mash function are sub-functions of a super-hash function. The regular has function (x mod n) works using the remainder or residues of values. So numbers are of the form c · x+ r, where r is the remainder obtained using the regular has function. When hashing by a value n, the resulting hashes map from the range 0 to n − 1. In essence, a set of values is formed so that each value in the set is {c ·x+0, c ·x+1, . . . , c ·x+(n−2), c ·x+(n−1)}. A hash function provides some distinction among the values, using the remainder or residue of the value. However, regular hashing experiences collisions, or where values hash to the same value. The problem is that values of the same remainder are indistinguishable to the regular hash function. Given a particular remainder r, all values which are multiples of c are equivalent. So a set of the form {c1 · x + r1, c2 · x + r1, . . . , cn−1 · x + r1, cn · x + r1}. So for n = 10, r = 1 the following set of values 3 are equivalent under the regular hash function: {1, 11, 21, 31, . . .1001, 10001, c ·x+1}. It is the equivalence of the values under regular hashing which causes collisions, as values map to the same hash result. There is no distinction among larger and smaller values within the same hash result. Some partitioning or separation of the values is obtained, but what is needed is another hash function to distinguish the hash values further by their magnitude relative to one another in the same set. This is what the mash function is for, a magnitude hash. A mash function is of the same form as a regular hash function, only it uses div rather than mod for (x mod n). The div operator on the form of a value c · x + r gives the value of c, where x is the base (usually decimal, base 10). The mash function maps values into a set where the values mash to the same result, based upon the magnitude. So the mash function shares the same problem as a regular hash that all values are mapped into an equivalent set. So a set of the from {c1 ·x+r1, c1 ·x+r2, . . . , c1 ·x+rn−1, c1 ·x+rn} has all values mashed to the same result. With n = 10, c = 3 the following set of values are equivalent under the mash function {30, 31, 32, 33, 34, 35, 36, 37, 38, 39}. With the mash function, some partitioning of the values is obtained, but there is no distinction among those values which is unique to the values. Together, however, a hash function, and mash function can both distinguish values, by magnitude, and by residue. The minimal form of this association between the two functions is an ordinal pair of the form (c, r) where c is the multiple of the base obtained with the mash function, and r is the remainder of the value obtained with the hash function. Essentially an ordinal pair form a unique representation to the value, using the magnitude constant, and residue. Further, each pair distinguishes larger and smaller values, using the mash result, and equal magnitudes are distinguished from each other using the residue. So the mapping is a perfect hash, as all values are uniquely mapped to a result, and is ordering, since the magnitude of the values is preserved in the mapping. A formal proof of this property, an injective mapping from one set to a resulting set, is given. The values for n in the hash function and mash function are determined by the range of values involved. The proof gives the validation of the injective nature of the mapping, and the mathematical properties of the parameters used, but no general guidelines for determining the parameters. The mathematical proof only indicates that the same mapping value must be used in the hash and mash functions. Multiple iterations of mapping functions can be used, so multiple mapping values can be used to form a set of mapping pairs. The choice of values for the mapping values depends on the number of dimensions to be mapped into, and the range of values that the mapping pairs can take on. Numerous, smaller mapping values will produce large mapping ordinates for the mash function, and small values for the hash function. This would be a diverse mix of values, but it depends upon the use of the hash sort and what is desired by the user of the algorithm. The organization of the data elements in the matrix can be of row-major or column-major order. The mapping into a column and row by the hash and mash function determines if the matrix is row or column major mapping. A row-major order mapping would have the rows mapped by the mash function, and the columns mapped by the hash function. A column major order mapping would interchange the mash and hash functions for the rows and columns respectively. 4 2.2. Construction of the Super-Hash Function. 2.2.1. Method of Construction. The super-hash function consists of a regular hash function using the mod operator, and a mash function using the div operator. Together these form ordinal pairs which uniquely map the data element into a square matrix. The important component of the super-hash function to determine is the mapping constant Θ. To determine Θ, it must be calculated from the range of the values that are to be mapped using the super-hash function. The range determines the dimensionality or size of the matrix, which is square for this super-hash function. Given a range R[i, j] where i is the lower-bound, and j is the upper-bound, then: 1. Compute the length L of the range where L = (j − i) + 1 . 2. Determine the nearest square integer to L. The nearest square is calculated by: Θ = ⌈ √ L⌉ The final value computed is the nearest square to L, which is the length of the range of values. In essence the values are being constructed in the form of a number which is tailored to the length of the range of values. value(dx, mx) = dx ·Θ+ mx where Θ is determined by the range of the values to be mapped. 2.2.2. Example of Constructing a Super-Hash Function. As an example, suppose you have a range of values from 13 to 123. The length of the range is 123 13 + 1. The length of this range is 111. The nearest square is then calculated as follows: Θ = ⌈ √ 111⌉ which evaluates as: Θ = ⌈10.53565375 . . .⌉
منابع مشابه
Analysis of Linear Time Sorting Algorithms
We derive CPU time formulae for the two simplest linear time-sorting algorithms, linear probing sort and bucket sort, as a function of the load factor, and show agreement with experimentally measured CPU times. This allows us to compute optimal load factors for each algorithm, whose values have previously been identified only approximately in the literature. We also present a simple model of ca...
متن کاملMerge Sort: Awards sorting algorithm with approximate linear time complexity
Bingheng Wang Dept. of Computer Science Florida Institute of Technology Melbourne FL 32904 [email protected] ABSTRACT Given a list of elements, we rearrange the list with ascending or descending order. So far the existing approaches involve Bubble Sort, Insertion Sort, Selection Sort ( respectively in worst case) performing slow. Now we introduce a new algorithm Merge Sort to solve the problem ...
متن کاملA Dynamically Reconfigurable Equi-Joiner on FPGA
As the increase of clock frequencies has slowed, special purpose hardware circuits are becoming increasingly important to accelerate the performance of computing systems. In this context, FPGAs offer advantages over hardcoded ASICs, since FPGAs allow us to use the entire chip to implement optimized algorithms for specific inputs by reconfiguring the circuits at runtime. For example, relational ...
متن کاملThe technique of in-place associative sorting
In the first place, a novel, yet straightforward in-place integer value-sorting algorithm is presented. It sorts in linear time using constant amount of additional memory for storing counters and indices beside the input array. The technique is inspired from the principal idea behind one of the ordinal theories of “serial order in behavior” and explained by the analogy with the three main stage...
متن کاملA Sort-based Algorithm for Multiple Sequence Alignment *
We propose a sort-based algorithm for multiple sequence alignment using anchors. Anchors are determined by the use of suffix sorting along with position-based sorts. Potential anchor points are identified by a careful exploitation of the sorted suffixes obtained from a generalized suffix array of the input sequences. Final alignment is obtained by a recursive application of the suffix-sorting a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره cs.DS/0408040 شماره
صفحات -
تاریخ انتشار 2004